Important Links: Project GitHub Repository Paul Adams Presentation Jeff Nguyen Presentation

Project Data Description

In this project, we analyze 3,202 stock price and volume data time series traded on the NASDAQ exchange between May 30th and October 30th, 2019. This date range was selected for its distance from significant biological and political disruption to the markets, which can both introduce artificial seasonality and increased random variation into forecasts. Data was sourced as comma-separated values through API from AlphaVantage.

Data Selection

Because of the time constraints involved with directly analyzing each stock’s realization, we developed a loop to process through each file, perform a linear model on each stock and where the slope for price is positive with a slope greater than 0.04 and the price is within an affordable range - between 5 and 50 dollars per share - we then select that stock to assess if the spectral density indicates any wandering behavior based on a peak at zero and no additional peaks thereafter. Because of this wandering, the sample ACFs were also expected to damp exponentially, indicative of non-stationary behavior, potentially trending behavior. This method allowed us to identify seven potentially ideal stocks for signal-plus-noise modeling with postive, profitable trending. From these 7 stocks, we selected one and modeled it. We chose this stock to model based on the stationarity of the noise around the signal.

plotts.sample.wge(df$low, trunc=25, arlimits=T)
files = list.files(path='../Time-Series-Stocks', pattern='*.csv')

for(file in files){
  actualFile <- paste0('../Time-Series-Stocks/',file)
  
  df <- read.csv(actualFile, header=T, strip.white=T)

  # run linear regression to get the signal (average).
  t = seq(1, nrow(df),1)
  fit = lm(df$low~t)
  
  # get the frequency values from the spectral density in the Parzen Window (we want them to wander without season; just trend)
  dbz <- plotts.sample.wge(df$low)[4] # plotting sample plots to see the stocks while they process

  # if the linear coefficient (deterministic signal) for the price is positive and the price is between 5 and 50 (affordable):
  if(fit$coefficients[2] > 0.04 && df$low[nrow(df)] > 5 && df$low[nrow(df)] < 50){
    for(i in 1:(length(dbz$dbz)-1)){
      # if the realization is wandering (based on spectral density):
      if(dbz$dbz[i] > dbz$dbz[i+1]){
        write.table(df$symbol, './models_aic_less_than_0.csv', append=T)
      }
    }
  }
}

Candlestick chart for Exploratory Data Analysis (EDA)

Because we have high, low, open, and close prices, we wanted to visually inspect the relationship between these across all data points once we selected the stock to model. Through this visual assessment, we identified differing amounts of variation within each price set across all data points. As a result, we decided to engineer two new features to capture this variation. These two new features allowed our multivariate modeling to parameterize additional insights for forecasting development.

urlfile = "https://raw.githubusercontent.com/PaulAdams4361/Time-Series-Project/master/NASDAQ_Daily_ACGL.csv"
df <- read_csv(url(urlfile))
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
df <- data.frame(Date=df$times, coredata(df[,2:5]))

fig <- df %>% plot_ly(x = ~Date, type="candlestick",
          open = ~open, close = ~close,
          high = ~high, low = ~low) 
fig <- fig %>% layout(title = "Candlestick Chart for ACGL")

fig

Candlestick Chart for ACGL

After anlayzing the candlestick plot, we decided to use the low price as the target feature of the model. The reason we chose this is because when a stock is trending up, the low price is quick to point this out because the moving average will often rise above the low price, especially for stronger uptrending behavior. This further provides insight into potential investment profitability.

Below are functions for data portability throughout the project. These are the two source data sets we will use.

Sample plot of original low price data

plotts.sample.wge(df$low, arlimits=T)

## $autplt
##  [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
##  [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  14.2537148   9.9541873   5.8173014  -0.5437278  -0.8235314
##  [6]  -1.3564258   1.4197169  -1.1176687  -7.3533813  -3.8782332
## [11]  -0.5445071  -3.3199576  -5.9250727  -5.4172047  -2.9311702
## [16] -11.1750831  -6.3519213 -18.5926329 -18.7079883  -8.4217804
## [21]  -9.3402655  -9.6710352  -9.7031584  -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501  -9.7837888
## [31] -12.9010407  -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
## 
## $dbz
##  [1]  10.6247182   9.9050329   8.7000676   7.0111987   4.8698055
##  [6]   2.3951954  -0.1075597  -2.1509930  -3.4458827  -4.1937204
## [11]  -4.7075065  -5.1324715  -5.5423563  -6.0346253  -6.7027509
## [16]  -7.5765092  -8.5979157  -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139

Sample plot of once-differenced low price data

plotts.sample.wge(dftrans$low, arlimits=T)

## $autplt
##  [1]  1.000000000  0.183338113 -0.087288073 -0.237238249 -0.305867805
##  [6] -0.078796894  0.093023749  0.049535162 -0.030050085 -0.018890445
## [11] -0.061602260  0.032022280 -0.006340976  0.013987908 -0.067126571
## [16] -0.018885062  0.053623779  0.087550930  0.039990694 -0.042584390
## [21]  0.046555568  0.002638270 -0.121413055 -0.170734093 -0.108245754
## [26]  0.034041133
## 
## $freq
##  [1] 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 0.06060606
##  [7] 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111 0.12121212
## [13] 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717 0.18181818
## [19] 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323 0.24242424
## [25] 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929 0.30303030
## [31] 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535 0.36363636
## [37] 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141 0.42424242
## [43] 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747 0.48484848
## [49] 0.49494949
## 
## $db
##  [1] -12.02192461  -3.41880995 -16.37917551  -0.65398900   0.06380267
##  [6] -10.48677972   3.84943304   0.29097292  -0.58134568  -8.41267862
## [11]   7.09269824   3.37965045  -8.18823270   4.07324873   7.41002062
## [16]  -9.96202731   2.43313338   3.66790531   4.67224868  -3.29023869
## [21] -17.80806558  -2.21306663  -3.70191891   1.22399133  -4.29135647
## [26]  -5.57149725  -4.24750077 -10.08378811   0.74404261  -3.35436014
## [31]  -8.42558674   3.90880953  -3.26777818   0.10095129   2.13744677
## [36]  -2.92553361 -10.85748651   4.00245066  -2.70327817 -17.79303064
## [41]  -7.14480011   1.35013762 -12.04069435  -3.97586777   1.44883756
## [46] -14.64216955  -1.60736693  -2.44746992  -8.43751621
## 
## $dbz
##  [1] -4.97475845 -4.07716660 -2.97303042 -1.88591192 -0.90492747
##  [6] -0.04471521  0.70766517  1.36795146  1.94443294  2.43681610
## [11]  2.83966838  3.14613100  3.34924324  3.44076747  3.40932510
## [16]  3.24009052  2.91739090  2.43026934  1.78033525  0.99095904
## [21]  0.11568360 -0.75978064 -1.53098712 -2.10584743 -2.43513704
## [26] -2.51843725 -2.38998876 -2.10596628 -1.73599851 -1.35057797
## [31] -1.00684070 -0.74110380 -0.57038660 -0.49899583 -0.52499087
## [36] -0.64274874 -0.84027113 -1.09295953 -1.35909561 -1.58511918
## [41] -1.72623996 -1.77392848 -1.76698265 -1.77315401 -1.85557632
## [46] -2.04429723 -2.31905385 -2.60453070 -2.79063339

Signal-Plus-Noise Model

In this Signal-Plus-Noise model, we perform a hypothesis test on the linear trend to identify using OLS if the trend is possibly deterministic. After positive confirmation from OLS, we then tested this with the Cochrane-Orcutt \(AR(1)\) based hypothesis test, which accounts for serial correlation in determining slope significance. This test also confirmed the signal is a deterministic trend.

Next, we removed the residuals from the trend line and built a model for this data. We then tested the residuals for white noise variance using the Ljung-Box test, which indicated there is not enough evidence to consider the residuals to be more than white noise. Because of the stationarity of the residuals, we were able to estimate a model using the linear slope and adding to it the variation of the residuals.

Forecasting error was measured in terms of Average Squared Error using the last trading week’s data points, for which there were five. The ASE was 0.1654078.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
# take a sample of the data, analyze
# plotts.sample.wge(df$low, arlimits=T)

####################
# Signal-Plus-Noise
####################
x = df$low
n = length(x)
t = 1:n
fit = lm(x ~ t)

Ordinary Least Squares Estimation of Original Data for Assessing Linear Trend

summary(fit) # there appears to be deterministic trend based on OLS; the p-value is significant so reject the null that it is not
## 
## Call:
## lm(formula = x ~ t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10389 -0.52398 -0.02693  0.34676  1.35981 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.855438   0.123444  282.36   <2e-16 ***
## t            0.072922   0.002122   34.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6126 on 98 degrees of freedom
## Multiple R-squared:  0.9234, Adjusted R-squared:  0.9226 
## F-statistic:  1181 on 1 and 98 DF,  p-value: < 2.2e-16
# deterministic. The null argues that any present trend is random that will eventually traverse such a pattern that the realization 
# will continue to approximate around the mean. 

# Because OLS is not robust to non-stationarity, we apply the Cochrane-Orcutt test to also test the beta coefficient slope of time
# using an aproach that fits an AR(1) model to the residuals:
cfit = cochrane.orcutt(fit) # to confirm with chochrane-orcutt

Cochrane-Orcutt Hypothesis Test of Original Data for Assessing Linear Trend

summary(cfit) # Cochran-Orcutt also provides a significant p-value. Based on this, we assume the slope is not equal to zero and
## Call:
## lm(formula = x ~ t)
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 35.0451902  0.3933902  89.085 < 2.2e-16 ***
## t            0.0698107  0.0063528  10.989 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 97 degrees of freedom
## Multiple R-squared:  0.5546 ,  Adjusted R-squared:  0.55
## F-statistic: 120.8 on 1 and 97 DF,  p-value: < 9.97e-19
## 
## Durbin-Watson statistic 
## (original):    0.39519 , p-value: 1.096e-16
## (transformed): 1.49316 , p-value: 3.732e-03
# therefore, there is deterministic that justifies fitting a signal-plus-noise model instead of an ARMA(p,q). However, ARMA(p,q)
# will be fitted later for comparison.

#x.z = x - fit$coefficients[1] - fit$coefficients[2]*t # derive residuals

Model Estimate for Noise Component of Signal-Plus-Noise Model

x.z <- fit$residuals # derive residuals (from the regression line)
ar.z <- aic.wge(x.z, p=0:6, type="aic") # find a model to use for approximating the residuals. NOTE: (sigmaHAT_a)^2 = 0.1177843
# ar.z$p is the order p (aic selects p=2 where q=0, as does the bic)
ar.z
## $type
## [1] "aic"
## 
## $value
## [1] -2.078901
## 
## $p
## [1] 2
## 
## $q
## [1] 0
## 
## $phi
## [1]  1.0533533 -0.3193699
## 
## $theta
## [1] 0
## 
## $vara
## [1] 0.1177843

Transforming Price Using Autoregressive Coefficients of Fitted Residuals

summary(fitco) # check to see if the transformed beta coefficient for the slope is still significant
## 
## Call:
## lm(formula = y.trans ~ t.trans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97241 -0.18510  0.01497  0.20566  0.70938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.320641   0.074125  125.74   <2e-16 ***
## t.trans     0.070037   0.004634   15.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3452 on 96 degrees of freedom
## Multiple R-squared:  0.7041, Adjusted R-squared:  0.701 
## F-statistic: 228.4 on 1 and 96 DF,  p-value: < 2.2e-16

Exploring Signal-Plus-Noise Model Residuals

The residuals visually appear to be close to white noise. The data points all center around the mean, which no longer appears to depend on time. However, there does not appear to be a finite, constant variance; it seems that at least a few points have a covariance structure that also appears to depend on time.

# Evaluating the residuals after Cochrane-Orcutt:
plotts.sample.wge(fitco$residuals, arlimits=T) # residuals appear to be white noise

## $autplt
##  [1]  1.000000000 -0.023293531  0.036962010 -0.014802296 -0.126405087
##  [6]  0.029796922  0.117072435  0.012129402 -0.043587820 -0.004767952
## [11] -0.097391375  0.041530277 -0.057857414  0.017044841 -0.087083880
## [16] -0.041907254 -0.005373860  0.008676478 -0.047742311 -0.123504966
## [21]  0.021732009 -0.013023861 -0.094545416 -0.117416320 -0.081848196
## [26] -0.012175991
## 
## $freq
##  [1] 0.01020408 0.02040816 0.03061224 0.04081633 0.05102041 0.06122449
##  [7] 0.07142857 0.08163265 0.09183673 0.10204082 0.11224490 0.12244898
## [13] 0.13265306 0.14285714 0.15306122 0.16326531 0.17346939 0.18367347
## [19] 0.19387755 0.20408163 0.21428571 0.22448980 0.23469388 0.24489796
## [25] 0.25510204 0.26530612 0.27551020 0.28571429 0.29591837 0.30612245
## [31] 0.31632653 0.32653061 0.33673469 0.34693878 0.35714286 0.36734694
## [37] 0.37755102 0.38775510 0.39795918 0.40816327 0.41836735 0.42857143
## [43] 0.43877551 0.44897959 0.45918367 0.46938776 0.47959184 0.48979592
## [49] 0.50000000
## 
## $db
##  [1]  -2.6777639   3.8135859  -5.9683163   1.5184256  -0.8803790
##  [6]  -8.3318773   3.2762270  -5.5042949  -2.7101826 -11.0399856
## [11]   4.2098310  -1.1794546 -15.1706294   3.0051947   4.2037907
## [16] -14.7263422   2.6009693  -1.7938045   4.0541794  -2.7999958
## [21] -10.9192077  -2.7720180  -1.0584065  -0.9384616  -7.2725793
## [26]  -8.4434667  -1.6379997  -2.6615413   2.9414994 -20.8707318
## [31]  -0.8912864   0.8280419  -4.9789348   4.5710824   2.0756433
## [36]  -3.3542192  -3.0244091   3.1969985   0.7698999  -3.2064604
## [41]  -1.0718805   1.6877260  -5.0394119   4.6670975   2.0074680
## [46]   0.9394747 -12.0509666  -1.1697708  -1.8475989
## 
## $dbz
##  [1] -0.38282185 -0.28397746 -0.19028322 -0.15763751 -0.20611289
##  [6] -0.31103987 -0.41258549 -0.44164356 -0.35347475 -0.14797764
## [11]  0.13763835  0.45148631  0.74363692  0.97075484  1.09544344
## [16]  1.08682979  0.92409708  0.60192986  0.13647776 -0.42990174
## [21] -1.02867951 -1.57512948 -1.98926204 -2.21734327 -2.23958303
## [26] -2.06472081 -1.72432810 -1.26936595 -0.76147780 -0.25913884
## [31]  0.19279969  0.56517106  0.84040110  1.00740432  1.05960352
## [36]  0.99785065  0.83757497  0.61677589  0.39748573  0.25011264
## [41]  0.21841929  0.28807954  0.39115926  0.44428758  0.38863017
## [46]  0.21366882 -0.03343047 -0.25549264 -0.34559474

Ljung-Box Hypothesis Test for White Noise Variance

There is not enough evidence based on the Ljung-Box test to reject the null hypothesis. Therefore, we reject the null hypothesis and do not assume the residuals are white noise.

ljung.wge(fitco$residuals)
## Obs -0.02329353 0.03696201 -0.0148023 -0.1264051 0.02979692 0.1170724 0.0121294 -0.04358782 -0.004767952 -0.09739137 0.04153028 -0.05785741 0.01704484 -0.08708388 -0.04190725 -0.00537386 0.008676478 -0.04774231 -0.123505 0.02173201 -0.01302386 -0.09454542 -0.1174163 -0.0818482
## $test
## [1] "Ljung-Box test"
## 
## $K
## [1] 24
## 
## $chi.square
## [1] 12.52529
## 
## $df
## [1] 24
## 
## $pval
## [1] 0.9733174
# Final Signal-Plus-Noise Model: X_t = 34.855438 + 0.072922*t + Z_t, (sigmaHAT_a)^2 = 0.1177843 summary(fit = lm(x ~ t))
# creates the coefficients
# (1 - 1.0533533*B + 0.3193699*B^2)*Z_t = a__t. ar.z$phi from AR(2) creates the coeffients and (sigmaHAT_a)^2 = 0.1177843

Model Estimated form the Signal-Plus-Noise Residuals:D AR(2)

\((1 - 1.0533533B + 0.3193699B^2)Z_{t} = a_{t}, \hat{\sigma}^2=0.1177843\)

ar.z
## $type
## [1] "aic"
## 
## $value
## [1] -2.078901
## 
## $p
## [1] 2
## 
## $q
## [1] 0
## 
## $phi
## [1]  1.0533533 -0.3193699
## 
## $theta
## [1] 0
## 
## $vara
## [1] 0.1177843

Estimated Model Sample Plots of Realization, Autocorrelation, and Spectral Density

First, we generated a realization using 100 random data points generated from the AR(p) model we estimated for the residuals and their variance along with the linear model’s intercept and slope coefficients.

Signal-Plus-Noise Model Comparison

We visualized the estimated realization’s ACF and Spectral Density plots using the tswge package below.

Estimated Model:

plotts.sample.wge(est_mod, arlimits=T)

## $autplt
##  [1] 1.0000000 0.9651366 0.9118145 0.8565512 0.8119889 0.7768475 0.7465706
##  [8] 0.7193186 0.6922062 0.6658128 0.6410263 0.6151013 0.5857824 0.5597746
## [15] 0.5307624 0.5006611 0.4660858 0.4286500 0.3901764 0.3577466 0.3377352
## [22] 0.3232818 0.3066006 0.2867248 0.2640283 0.2377298
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  14.3656799  10.0443945   4.2373256   0.3328415   3.7183504
##  [6]  -2.9293040  -4.4417351   3.3619398  -4.9979624  -6.0664768
## [11]  -9.1976096  -2.4333288  -4.4143009  -9.4646269  -5.4455831
## [16]  -7.8048186 -13.4622979  -6.1605324  -7.4926226 -13.0960169
## [21] -22.9914819 -16.3550408  -9.4486304 -17.4872277 -18.9786086
## [26] -12.5301829 -16.6987460 -15.5008055 -11.5993071 -17.8186056
## [31] -18.1266390 -19.1975188 -15.2008114 -21.5084173 -18.7413264
## [36] -14.4695154 -16.8957406 -18.8504054 -12.8515599 -20.4378658
## [41] -18.0737498 -19.9406019 -17.1370027 -17.0168315 -16.6613265
## [46] -14.2727801 -17.1060521 -17.8422981 -17.1279923 -26.4868302
## 
## $dbz
##  [1]  10.6211100   9.9096304   8.7223550   7.0687089   4.9951979
##  [6]   2.6416904   0.3110044  -1.5988486  -2.9302225  -3.8882665
## [11]  -4.6975926  -5.4166271  -6.0512655  -6.6709255  -7.3787573
## [16]  -8.2217302  -9.1548511 -10.0826599 -10.9421629 -11.7433498
## [21] -12.5212558 -13.2727162 -13.9604085 -14.5674614 -15.1205846
## [26] -15.6523341 -16.1653392 -16.6486865 -17.1148075 -17.5949536
## [31] -18.0909271 -18.5424199 -18.8588992 -18.9968950 -18.9955537
## [36] -18.9338563 -18.8788838 -18.8729126 -18.9358535 -19.0569523
## [41] -19.1857241 -19.2516426 -19.2186147 -19.1245018 -19.0552735
## [46] -19.0825604 -19.2220228 -19.4272086 -19.6101446 -19.6834596

Then, we compared the sample plots to those for the original data, visually comparing to see if the estimated ACF and Spectral Density plots match between the two time series. Because they do, the model is a reasonable fit.

Estimated Data:

plotts.sample.wge(df$low, arlimits=T) # the estimated model (above) matches to the actual model (here) on both sample ACFs and sample spectral

## $autplt
##  [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
##  [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  14.2537148   9.9541873   5.8173014  -0.5437278  -0.8235314
##  [6]  -1.3564258   1.4197169  -1.1176687  -7.3533813  -3.8782332
## [11]  -0.5445071  -3.3199576  -5.9250727  -5.4172047  -2.9311702
## [16] -11.1750831  -6.3519213 -18.5926329 -18.7079883  -8.4217804
## [21]  -9.3402655  -9.6710352  -9.7031584  -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501  -9.7837888
## [31] -12.9010407  -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
## 
## $dbz
##  [1]  10.6247182   9.9050329   8.7000676   7.0111987   4.8698055
##  [6]   2.3951954  -0.1075597  -2.1509930  -3.4458827  -4.1937204
## [11]  -4.7075065  -5.1324715  -5.5423563  -6.0346253  -6.7027509
## [16]  -7.5765092  -8.5979157  -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139
# density as well as on partial ACF (below):

We then compared the Partials ACF plots between the two time series to again visually inspect if the two approximate each other.

Estimated Model:

pacf(est_mod)

Original Data:

pacf(df$low)

We then permutated this by looping the plots for both realizations twice. However, we suppressed this output to limit excess printout into the documentation. Please refer to the appendix for their actual outputs.

Signal-Plus-Noise Forecast ASE

SIGNOISE_ASE # 0.133713
## [1] 0.133713

ARIMA(p,d,q) Model

In the ARIMA model, we selected a forecast horizon of five trading days because this timeframe completes a full business week. Models can be fully re-developed on non-trading days. However, unless there are two unit roots, ARIMA will not have enough precedential autocorrelation in the realization for a forecasted trend to continue. Consequently, it will have no option but to converge toward the mean. Because of the lack of a seasonal component, this model may not model well. However, as seen with the Signal-Plus-Noise model, the realization’s strong signal and weak noise components will work well for an Autoregressive/Moving Average model that tends toward the mean since this is what appears to be guiding the primary growth structure of the share price over time.

Because of the strong, positive signal of the price over time, the slowly dampening ACFs, and the strongest root in the series being a \((1-B)\), we differenced the data once to stationarize the realization. Following this, we identified the differenced data to be best represented by modelling with an \(AR(5)\) component based on Akaike Information Criterion (AIC) scores. While this was not the top selected by the algorithm, we tested several lag orders and determined this was the best suited selection and that there was not enough noise in the data to benefit from adding an MA term. A model was estimated to a \(ARMA(0,0)\) - practically, white noise - by the Bayesian Information Criterion score, but because AIC did not suggest this and we identified enough noise to justify modeling, we proceeded with an \(AR(5)\) component for the ARMA model.

The estimated residuals from the estimated \(ARMA(5,0)\) model at \(K=24\) lags produced a p-value\(=0.9405\) that we could not use to reject the null hypothesis, where the null is that the distribution of the residuals of the model are close enough to white noise that we cannot effectively distinguish the difference. Low variance (0.1173) also aided in our conclusion to assume the model represents the data well. We then attempted to identify a new model for the residuals using an AIC score, but the top model provided was a \(ARMA(0,0)\). This concluded our model selection analysis for this model.

Finally, we generated realizations using 99 data points (99 because the difference removed one from the original 100 points) and the estimated model parameters. The spectral density and autocorrelation function plots appeared to match closely to those of the original realization, sugged toward the long-run series mean.

In forecasting with the estimated \(ARIMA(5,1,0)\), the results were well placed with an Average Squared Error (ASE) of 0.2026529. While the model performed well over a 5-point forecast, this model would most likely tend back toward the long-run mean with a larger forecast horizon.

Estimated ARMA Factor Table with Model Output

First, we need to get an understanding of the estimated polynomial factor terms of the realization to understand differencing requirements or if there are seasonal components.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
###########################################################
########################################################### ARMA/ARIMA
###########################################################

#factor.wge(df$low)
est.arma.wge(df$low, p=8, q=3) # the factor table for df$low provides one (1-B) represented as (1-0.9956B), a close-enough 
## 
## Coefficients of Original polynomial:  
## -0.3746 0.7725 0.3401 -0.1343 -0.0358 0.3197 0.1747 -0.0838 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-0.9947B              1.0054               0.9947       0.0000
## 1+1.6828B+0.7314B^2   -1.1504+-0.2091i      0.8552       0.4714
## 1+0.9128B+0.6393B^2   -0.7139+-1.0269i      0.7996       0.3467
## 1-0.9172B+0.5823B^2    0.7875+-1.0474i      0.7631       0.1474
## 1-0.3093B              3.2335               0.3093       0.0000
##   
## 
## $phi
## [1] -0.37456915  0.77252645  0.34011944 -0.13429170 -0.03580418  0.31974554
## [7]  0.17466434 -0.08376272
## 
## $theta
## [1] -1.5943772 -0.8563684 -0.2326950
## 
## $res
##   [1]  0.341382127 -0.339533981  0.372992944  0.045251115  0.282883734
##   [6]  0.543863341  0.027683372 -0.055313770 -0.400705173  0.236964242
##  [11] -0.066224077 -0.109657984  0.373268255 -0.028417015  0.268832218
##  [16]  0.321186364  0.139491809 -0.060923904 -0.207501945 -0.509810842
##  [21]  0.435881683  0.583222371  0.425193624  0.341428403  0.299644074
##  [26]  0.165902559  0.268444218  0.150980721  0.417357079 -0.476663592
##  [31]  0.367971194 -0.066208814  0.238830218 -0.357908092 -0.154882518
##  [36] -0.260220286 -0.126198191 -0.314164842  0.194647972  0.075843457
##  [41]  0.065674888  0.249249868  0.326605797  0.129696758  0.072656694
##  [46] -0.081003028 -0.590131520  0.281198743  0.348383312  0.735499563
##  [51] -0.244213897 -0.177292255  0.295402720 -0.397741559  0.068422622
##  [56]  0.468362947  0.658157668 -0.226977155 -0.082270354  0.152294282
##  [61] -0.788235363  0.327803117 -0.054959335 -0.071767723  0.259976113
##  [66]  0.155307265 -0.019760354  0.873508519  0.739690436  0.089012192
##  [71] -0.039686548 -0.439091020  0.389397691 -0.057910431 -0.435274770
##  [76]  0.127990892  0.132624679  0.123202670  0.370572426 -0.291439393
##  [81]  0.414131539  0.079728839  0.261378249  0.571631165 -0.110336678
##  [86]  0.186149091 -0.454963807 -0.443809259 -0.007272801  0.584743178
##  [91]  0.077730704 -1.018479869  0.311826735  0.461004571  0.090793997
##  [96] -0.174995085  0.530696568 -0.456227949  0.815178176 -0.275805473
## 
## $avar
## [1] 0.1295548
## 
## $aic
## [1] -1.803651
## 
## $aicc
## [1] -0.7413255
## 
## $bic
## [1] -1.491031
## 
## $se.phi
## [1] 0.56643405 0.22018656 0.51330869 0.16824213 0.06855007 0.02066278
## [7] 0.04665927 0.02784275
## 
## $se.theta
## [1] 0.5715739 1.2523688 0.2789225
# approximation. Therefore, we difference the data once. Preliminary evidence suggesting differencing is useful is based 
# on the specified wandering and the damping sample autocorrelations. When using an overfit table, there was a factor close
# to (1-b)^2, but it was not very significant (third most significant using estimated_arma <- est.arma.wge(dftrans, p=15, q=1))
# so we decided to only use a first difference.

There are no seasonal components, but the \((1-B)\) factor indicates differencing once would likely stationarize the data enough to model it with AR/MA terms.

Differenced data sample plots

plotts.sample.wge(dftrans, arlimits=T)

## $autplt
##  [1]  1.000000000  0.183338113 -0.087288073 -0.237238249 -0.305867805
##  [6] -0.078796894  0.093023749  0.049535162 -0.030050085 -0.018890445
## [11] -0.061602260  0.032022280 -0.006340976  0.013987908 -0.067126571
## [16] -0.018885062  0.053623779  0.087550930  0.039990694 -0.042584390
## [21]  0.046555568  0.002638270 -0.121413055 -0.170734093 -0.108245754
## [26]  0.034041133
## 
## $freq
##  [1] 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 0.06060606
##  [7] 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111 0.12121212
## [13] 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717 0.18181818
## [19] 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323 0.24242424
## [25] 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929 0.30303030
## [31] 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535 0.36363636
## [37] 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141 0.42424242
## [43] 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747 0.48484848
## [49] 0.49494949
## 
## $db
##  [1] -12.02192461  -3.41880995 -16.37917551  -0.65398900   0.06380267
##  [6] -10.48677972   3.84943304   0.29097292  -0.58134568  -8.41267862
## [11]   7.09269824   3.37965045  -8.18823270   4.07324873   7.41002062
## [16]  -9.96202731   2.43313338   3.66790531   4.67224868  -3.29023869
## [21] -17.80806558  -2.21306663  -3.70191891   1.22399133  -4.29135647
## [26]  -5.57149725  -4.24750077 -10.08378811   0.74404261  -3.35436014
## [31]  -8.42558674   3.90880953  -3.26777818   0.10095129   2.13744677
## [36]  -2.92553361 -10.85748651   4.00245066  -2.70327817 -17.79303064
## [41]  -7.14480011   1.35013762 -12.04069435  -3.97586777   1.44883756
## [46] -14.64216955  -1.60736693  -2.44746992  -8.43751621
## 
## $dbz
##  [1] -4.97475845 -4.07716660 -2.97303042 -1.88591192 -0.90492747
##  [6] -0.04471521  0.70766517  1.36795146  1.94443294  2.43681610
## [11]  2.83966838  3.14613100  3.34924324  3.44076747  3.40932510
## [16]  3.24009052  2.91739090  2.43026934  1.78033525  0.99095904
## [21]  0.11568360 -0.75978064 -1.53098712 -2.10584743 -2.43513704
## [26] -2.51843725 -2.38998876 -2.10596628 -1.73599851 -1.35057797
## [31] -1.00684070 -0.74110380 -0.57038660 -0.49899583 -0.52499087
## [36] -0.64274874 -0.84027113 -1.09295953 -1.35909561 -1.58511918
## [41] -1.72623996 -1.77392848 -1.76698265 -1.77315401 -1.85557632
## [46] -2.04429723 -2.31905385 -2.60453070 -2.79063339

AIC5 model identification

aic5.wge(dftrans, type="aic")
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Five Smallest Values of  aic
##       p    q        aic
## 17    5    1  -2.001944
## 13    4    0  -1.985397
## 14    4    1  -1.966194
## 16    5    0  -1.966140
## 10    3    0  -1.934893
# AIC = -2.001944. p=5, q=0

AR(p) model estimation

estimated_arma <- est.arma.wge(dftrans, p=5, q=0)
## 
## Coefficients of Original polynomial:  
## 0.1227 -0.1183 -0.1454 -0.2587 -0.0292 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.0644B+0.6446B^2    0.8257+-0.9325i      0.8028       0.1347
## 1+0.8220B+0.3777B^2   -1.0882+-1.2098i      0.6146       0.3666
## 1+0.1198B             -8.3479               0.1198       0.5000
##   
## 
estimated_arma
## $phi
## [1]  0.12269667 -0.11827203 -0.14543411 -0.25874556 -0.02916165
## 
## $theta
## [1] 0
## 
## $res
##  [1] -0.289245266  0.317654288 -0.016022173  0.105702188  0.490654708
##  [6]  0.059592764 -0.089170202 -0.436866951  0.190821096 -0.183826752
## [11] -0.133053542  0.222431566 -0.099125244  0.174217625  0.305999693
## [16]  0.118168742 -0.100188633 -0.215896791 -0.620157196  0.303666324
## [21]  0.530759553  0.337766511  0.340491303  0.267145127  0.171745912
## [26]  0.289766770  0.221495923  0.333815247 -0.450984386  0.211100639
## [31] -0.097746981  0.102448378 -0.358014962 -0.335244979 -0.339069979
## [36] -0.349161730 -0.359020044 -0.088198838  0.053277680 -0.181713351
## [41]  0.266705727  0.150095891  0.178702354 -0.041332764 -0.047054007
## [46] -0.754715812  0.214445674  0.199088415  0.658590052 -0.280769931
## [51] -0.266009493  0.198627416 -0.453577533  0.023190343  0.328035692
## [56]  0.595110152 -0.348888727 -0.079268731  0.006473611 -0.826650827
## [61]  0.209048135 -0.162479598 -0.269408033  0.121848632  0.048802929
## [66] -0.154708525  0.827601048  0.777977575  0.025396829  0.052379905
## [71] -0.545508434  0.301407721 -0.107113538 -0.515549075 -0.104692651
## [76]  0.004773155 -0.058183207  0.291514073 -0.335326329  0.233306466
## [81]  0.078255747  0.138058931  0.529460586 -0.163958610  0.114867197
## [86] -0.552392805 -0.523539737 -0.231000679  0.485342239 -0.095980338
## [91] -1.136692032  0.096283333  0.282870154  0.029885277 -0.230371084
## [96]  0.398326768 -0.604156803  0.710349968 -0.248876449
## 
## $avar
## [1] 0.1240151
## 
## $aic
## [1] -1.96614
## 
## $aicc
## [1] -0.9335058
## 
## $bic
## [1] -1.80886
## 
## $se.phi
## [1] 0.01010279 0.01019390 0.01053728 0.01060975 0.01103595
## 
## $se.theta
## [1] 0

ARIMA(p,d,q) in Backshift Notation, Unfactored & Factored:

\((1-B)(1-0.12269667B+0.11827203B^2+0.14543411B^3+0.25874556B^4-0.02916165B^5)(X_t + 38.53802) = a_t\), where \(\hat{\sigma}_{a}^2 = 0.1240151\)

In factored form: \((1-1.12269667B+0.00442464B^2+0.26370614B^3+0.11331145B^4-0.28790721B^5+0.02916165B^6)(X_t + 38.53802) = a_t\) where \(\hat{\sigma}_a^2 = 0.1240151\)

Quantifying Residuals of ARIMA(5,1,0)

The Ljung-Box test indicates the residuals are very similar to white noise, indicating there is potentially a good fit by the ARIMA(p,d,q)

#estimated_arma <- est.arma.wge(dftrans, p=15, q=1)
ljung.wge(estimated_arma$res, p=5, q=0) # suggests there is no serial correlation in the residuals of the model; this is a good fit.
## Obs -0.001755173 -0.0006645789 -0.03189077 -0.0558602 -0.03745613 -0.006650248 -0.07239689 -0.1447362 -0.01711384 -0.08391131 0.05274902 -0.02483137 0.045087 -0.06707539 -0.004246384 0.07628534 0.08560867 -0.0009574519 -0.1290798 0.02213024 0.01599169 -0.07501175 -0.1093225 -0.02137378
## $test
## [1] "Ljung-Box test"
## 
## $K
## [1] 24
## 
## $chi.square
## [1] 11.57463
## 
## $df
## [1] 19
## 
## $pval
## [1] 0.9029941
estimated_arma$avar # 0.1240151
## [1] 0.1240151
mean(df$low) # 38.53802
## [1] 38.53802
# (1-B)(1-0.12269667B+0.11827203B^2+0.14543411B^3+0.25874556B^4-0.02916165B^5)*(X_t + 38.53802) = a_t, (sigma-hat_a)^2 = 0.1240151
# (1-1.12269667B+0.00442464B^2+0.26370614B^3+0.11331145B^4-0.28790721B^5+0.02916165B^6)*(X_t + 38.53802) = a_t, (sigma-hat_a)^2 = 0.1240151

Plotting Residuals of ARIMA(p,d,q)

######
###### Plotting residuals

# the residuals of the model do not appear correlated. The sample autocorrelation and partial autocorrelation plots indicate 
# stationarity across all lags with all lags measured within the 5% limits
plotts.sample.wge(estimated_arma$res, arlimits=T)

## $autplt
##  [1]  1.0000000000 -0.0017551730 -0.0006645789 -0.0318907670 -0.0558601955
##  [6] -0.0374561274 -0.0066502482 -0.0723968930 -0.1447362269 -0.0171138352
## [11] -0.0839113138  0.0527490212 -0.0248313665  0.0450869990 -0.0670753909
## [16] -0.0042463842  0.0762853413  0.0856086721 -0.0009574519 -0.1290798025
## [21]  0.0221302437  0.0159916891 -0.0750117483 -0.1093225110 -0.0213737804
## [26]  0.0330170662
## 
## $freq
##  [1] 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 0.06060606
##  [7] 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111 0.12121212
## [13] 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717 0.18181818
## [19] 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323 0.24242424
## [25] 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929 0.30303030
## [31] 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535 0.36363636
## [37] 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141 0.42424242
## [43] 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747 0.48484848
## [49] 0.49494949
## 
## $db
##  [1]  -8.833455090   0.263787556 -11.498115493   2.117874832   2.272952545
##  [6]  -9.309718287   5.018377394   0.289213348  -1.892315270 -11.308615271
## [11]   3.725194391  -0.611197010 -12.632295451  -0.590314051   3.046760336
## [16] -12.371079560   0.004883422   2.510605713   4.337557356  -3.178637232
## [21] -17.271334056  -1.077447786  -2.257551049   3.231804970  -1.965461358
## [26]  -3.043185676  -1.825761971  -7.420727112   2.312103352  -1.619957827
## [31]  -7.841066943   5.001502888  -2.103213653   0.521620232   2.860045258
## [36]  -2.911236603  -9.089689687   4.807535654  -1.618776046 -13.746143879
## [41]  -6.930503276   3.442666207  -8.907109791  -0.989371943   3.916436743
## [46] -14.309350664   0.883844166   0.342804251  -6.422777409
## 
## $dbz
##  [1] -2.1421139616 -1.5268677158 -0.7764179465 -0.0840477881  0.4450188370
##  [6]  0.7709948539  0.8934228747  0.8371033041  0.6469486934  0.3863575660
## [11]  0.1329634098 -0.0354873769 -0.0674530562  0.0372285907  0.2260959001
## [16]  0.4205848653  0.5490294530  0.5673324474  0.4653779820  0.2642966345
## [21]  0.0072262109 -0.2555822359 -0.4825978402 -0.6484742078 -0.7390119560
## [26] -0.7427274437 -0.6522279590 -0.4757583702 -0.2446363728 -0.0055391248
## [31]  0.1965782737  0.3316246748  0.3873844148  0.3649753175  0.2727409570
## [36]  0.1240460268 -0.0586583169 -0.2366934318 -0.3546826477 -0.3582603902
## [41] -0.2277399300 -0.0007442644  0.2413219942  0.4108729200  0.4464568416
## [46]  0.3298807324  0.0947313908 -0.1721868052 -0.3517455314

ACF and PACF Plots for ARIMA(p,d,q) Residuals

par(mfrow=c(3,1))
plot(estimated_arma$res, ylab="ARIMA Residuals", xlab = "Trading Days", main = "ARIMA(5,1,1) Model Residuals over Time", lwd=2, type="l")
abline(h=mean(estimated_arma$res), col="blue")
acf(estimated_arma$res) # this seems to be white noise; all lags are within the 5% limits
pacf(estimated_arma$res) # this seems to be white noise; all lags are within the 5% limits

Esimated White Noise Model for ARIMA(p,d,q) Residuals

aic5.wge(estimted_arma$res, type="aic") # White noise is the top model selected for the residuals, by AIC (ARMA(0,0))
## ---------WORKING... PLEASE WAIT... 
## 
## 
## Error in aic calculation at 0 0 
## Error in aic calculation at 0 1 
## Error in aic calculation at 0 2 
## Error in aic calculation at 1 0 
## Error in aic calculation at 1 1 
## Error in aic calculation at 1 2 
## Error in aic calculation at 2 0 
## Error in aic calculation at 2 1 
## Error in aic calculation at 2 2 
## Error in aic calculation at 3 0 
## Error in aic calculation at 3 1 
## Error in aic calculation at 3 2 
## Error in aic calculation at 4 0 
## Error in aic calculation at 4 1 
## Error in aic calculation at 4 2 
## Error in aic calculation at 5 0 
## Error in aic calculation at 5 1 
## Error in aic calculation at 5 2 
## Five Smallest Values of  aic
##      p    q        aic
## 1    0    0     999999
## 2    0    1     999999
## 3    0    2     999999
## 4    1    0     999999
## 5    1    1     999999
######
######

######
###### Compare estimated model to differenced data

ARIMA(p,d,q) Model Comparison

After estimating the model we will use for forecasting, we generated 99 random data points (99 because the 100 data points were differenced once, removing a point) using an non-seasonal ARIMA(p,d,q). We then compared the ACF and Spectral Density plots to those of the original data. A match on these will indicate a potentially good fit.

Estimated Model:

# We generated a model of 99 data points using the ARMA(5,1) identified by aic5 with a variance equal to that estimated from the
# differenced data, which is 0.1172604 (select to see origin highlighted above)
est_mod <- gen.aruma.wge(99, phi=estimated_arma$phi, theta=estimated_arma$theta, d=1, vara=estimated_arma$avar, sn=40)

plotts.sample.wge(est_mod, arlimits=T) # estimated model

## $autplt
##  [1] 1.0000000 0.9319568 0.8616508 0.8115232 0.7837261 0.7798462 0.7686385
##  [8] 0.7540206 0.7238039 0.6950334 0.6636351 0.6243891 0.5993385 0.5769297
## [15] 0.5519124 0.5366042 0.5296241 0.5194643 0.4818245 0.4413453 0.3916907
## [22] 0.3539872 0.3181363 0.2818305 0.2584268 0.2245354
## 
## $freq
##  [1] 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505 0.06060606
##  [7] 0.07070707 0.08080808 0.09090909 0.10101010 0.11111111 0.12121212
## [13] 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717 0.18181818
## [19] 0.19191919 0.20202020 0.21212121 0.22222222 0.23232323 0.24242424
## [25] 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929 0.30303030
## [31] 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535 0.36363636
## [37] 0.37373737 0.38383838 0.39393939 0.40404040 0.41414141 0.42424242
## [43] 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747 0.48484848
## [49] 0.49494949
## 
## $db
##  [1]  15.4662467   6.4612009   2.8156959  -6.5831271  -0.3491412
##  [6]   0.6467012 -23.4462233 -11.9860652  -4.2541616  -2.3339334
## [11]  -4.9973000  -0.5165929  -8.4672734  -5.9260513 -12.5070338
## [16]  -1.5329287  -9.4007029  -4.2193607 -12.8919484 -17.0150048
## [21]  -6.9836378 -14.0339519  -5.8623391 -15.3945590 -11.2233206
## [26] -24.0796617 -10.8749229 -18.3577480 -11.7127985  -7.5363064
## [31] -14.6647063 -14.4603439 -11.9075717 -18.9642429 -19.7115192
## [36] -13.9330289 -14.7859997 -12.3169328 -15.2897638 -15.2611997
## [41]  -7.5246856 -25.4508518 -13.1571835 -18.5429171 -14.5183414
## [46] -11.9244393 -15.4696702 -20.2139032 -16.6394603
## 
## $dbz
##  [1]  10.3894114   9.7014613   8.5461122   6.9166900   4.8252742
##  [6]   2.3519955  -0.2434927  -2.4200960  -3.6721480  -4.1238920
## [11]  -4.2332485  -4.2725076  -4.3507834  -4.5391300  -4.8913942
## [16]  -5.4201345  -6.0871771  -6.8234956  -7.5683596  -8.2978154
## [21]  -9.0157797  -9.7227007 -10.4014424 -11.0315750 -11.6029838
## [26] -12.1026903 -12.4960138 -12.7466330 -12.8695657 -12.9469286
## [31] -13.0762896 -13.3067957 -13.6115217 -13.9016587 -14.0770531
## [36] -14.0895034 -13.9687564 -13.7936668 -13.6492815 -13.6034164
## [41] -13.6994736 -13.9536233 -14.3531696 -14.8590436 -15.4153465
## [46] -15.9631305 -16.4485233 -16.8195565 -17.0238291

Original Data:

plotts.sample.wge(df$low, arlimits=T) # estimated model

## $autplt
##  [1] 1.0000000 0.9542096 0.8987323 0.8524345 0.8100849 0.7813310 0.7600881
##  [8] 0.7371705 0.7123568 0.6828689 0.6476693 0.6153474 0.5864842 0.5600570
## [15] 0.5282542 0.4955941 0.4660246 0.4343222 0.4026243 0.3675647 0.3278754
## [22] 0.2900614 0.2568547 0.2354177 0.2239847 0.2178967
## 
## $freq
##  [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14
## [15] 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
## [29] 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42
## [43] 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50
## 
## $db
##  [1]  14.2537148   9.9541873   5.8173014  -0.5437278  -0.8235314
##  [6]  -1.3564258   1.4197169  -1.1176687  -7.3533813  -3.8782332
## [11]  -0.5445071  -3.3199576  -5.9250727  -5.4172047  -2.9311702
## [16] -11.1750831  -6.3519213 -18.5926329 -18.7079883  -8.4217804
## [21]  -9.3402655  -9.6710352  -9.7031584  -8.2549030 -14.1830462
## [26] -11.3498316 -12.9743644 -12.1531178 -16.0829501  -9.7837888
## [31] -12.9010407  -9.5737077 -12.8850744 -15.0135304 -10.3417581
## [36] -16.6716427 -13.1487825 -11.0083985 -16.2745553 -18.6304870
## [41] -17.7423184 -13.7665082 -16.0810139 -16.9046379 -13.1224062
## [46] -20.5533807 -18.2518162 -14.7903000 -19.2395748 -19.0265205
## 
## $dbz
##  [1]  10.6247182   9.9050329   8.7000676   7.0111987   4.8698055
##  [6]   2.3951954  -0.1075597  -2.1509930  -3.4458827  -4.1937204
## [11]  -4.7075065  -5.1324715  -5.5423563  -6.0346253  -6.7027509
## [16]  -7.5765092  -8.5979157  -9.6409675 -10.5694523 -11.2993670
## [21] -11.8169224 -12.1628463 -12.4196726 -12.6865263 -13.0293947
## [26] -13.4394270 -13.8345951 -14.1178754 -14.2557593 -14.2960188
## [31] -14.3127453 -14.3564021 -14.4497949 -14.6054534 -14.8345314
## [36] -15.1426627 -15.5243385 -15.9631650 -16.4335783 -16.8977220
## [41] -17.3041080 -17.6055003 -17.7946958 -17.9201902 -18.0549021
## [46] -18.2464892 -18.4904595 -18.7368379 -18.9191710 -18.9861139

We then permutated the plots to assess over multiple estimated model random generations. The output has been suppressed here, but is displayed in the appendix.

ARIMA(p,d,q) Forecast ASE

######
###### Forecast and Measure Average Squared Errors (ASE)
# Model Forecast and ASE
# Forecasting the differenced data using the parameters estimated from it (we do not apply a d=1)
ARIMA_ASE# 0.2026529
## [1] 0.2026529
######
######

Multivariate Regression Time Series Modeling

Following the univariate modeling of this time series, we decided to enhance our model with multivariate factors, both additive and multiplicative. As previously mentioned, based on the candlestick plot we created for exploratory data analysis, we were interested in exploring the relationship of the ranges of open to close and high to low prices as related to the low price. The features we engineered for this aided in describing the data and enhancing the models.

The multivariate models applied include Vector Autoregressive and Neural Network with Multi-Layer Perceptron (MLP) models, in addition to an ensemble model combining the forecasts of the univariate Signal-Plus-Noise and the multivariate MLP model. The ASE scores for all multivariate models outperformed all univariate models. However, an ensemble of Signal-Plus-Noise and the MLP model provides a safeguard of overfitting from a multivariate model given the dependency on additional variables that can change significantly very quickly, disrupting a complex forecast.

Vector Autoregressive Modeling

Vector Autoregressive Modeling without Trend

The vector Autoregressive (VAR) models used take all 75 points of the training data - including for the exogenous variables - to bulid a model. We forecasted 5 data points ahead from this trainin set of 95 out of 100 data points and then generated the overall ASE and the 5-point rolling window ASE for model comparison.

The autoregressive model identifed is an AR(2) with an AIC score of -2.4212.

We then performed lag analysis using the cross-correlation plots to determine if variables should be lagged. There did not appear to be strong evidence suggesting variables should be lagged respective of the repsonse. Furthermore, there did not appear to be a gross violation of the limits within the ACFs either, suggesting the data is reasonably stationary. This information is useful for all downstream multivariate forecasting.

# First Variable has the "time shift"
par(mfrow=c(3,1))
ccf(df$volume,df$low, type="correlation") # nothing too significant to warrant lagging volume
ccf(df$HiLo,df$low, type="correlation") # no strong cross-correlation
ccf(df$OpClo,df$low, type="correlation") # Open-Close does not require lagging.

The first VAR model recommended an AR(5) component

df2 <- df[1:(nrow(df)-5),]
VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(df2$volume, df2$HiLo, df2$OpClo), type="const") # AIC picked p=1 with AIC -2.3235
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      5      2      2      5 
## 
## $criteria
##                  1           2           3           4          5
## AIC(n) -2.32345681 -2.42117979 -2.41376593 -2.40362716 -2.4365656
## HQ(n)  -2.26710299 -2.35355521 -2.33487058 -2.31346105 -2.3351287
## SC(n)  -2.18364578 -2.25340655 -2.21803048 -2.17992951 -2.1849057
## FPE(n)  0.09794606  0.08883496  0.08950683  0.09043349  0.0875214
##                  6
## AIC(n) -2.42032626
## HQ(n)  -2.30761863
## SC(n)  -2.14070419
## FPE(n)  0.08897736

The ASE for this model is 0.2283

lsfit <- VAR(y=cbind(df2$low, df2$volume, df2$HiLo, df2$OpClo), p=2, type="const") # with p=5, this will estimate the coefficients for lag 5

preds <- predict(lsfit, n.ahead=5)

ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.2282559
## [1] 0.2282559

Plotting the forecast

plot(seq(1,100,1), df$low[1:100], type = "l", xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)

Vector Autoregressive Modeling with Trend

In the VAR model with trend, we introduce a time component into the model as an additive predictor term. The model identifed is an AR(2) with an AIC score of -2.499.

df2 <- df[1:(nrow(df)-5),] 
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.
VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo), type="const") # AIC:  p=1 with AIC -2.3158
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      2      2      2 
## 
## $criteria
##                 1           2           3           4           5
## AIC(n) -2.3288269 -2.49848113 -2.47601445 -2.45354442 -2.45473227
## HQ(n)  -2.2612024 -2.41958579 -2.38584835 -2.35210755 -2.34202465
## SC(n)  -2.1610537 -2.30274568 -2.25231680 -2.20188456 -2.17511021
## FPE(n)  0.0974299  0.08223654  0.08411857  0.08604793  0.08596807
##                  6
## AIC(n) -2.45522002
## HQ(n)  -2.33124163
## SC(n)  -2.14763575
## FPE(n)  0.08595343

The ASE for this model is 0.1132

lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1
preds <- predict(lsfit, n.ahead=5)

ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1132188
## [1] 0.1132188

Plotting the forecast

plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)

Vector Autoregressive Modeling with Trend, Trend and Volume Interaction

Our VAR model with additive terms for trend, volume, the spread between high and low and open and close prices in addition to a multiplicative term for trend and its impact on volume outperformed all other VAR models we built in terms of ASE. Adding the multiplicative term in this model reduced the ASE to roughly half, from 0.2139 to 0.1296. The Autoregressive lag order identified for the model is an AR(2) with an AIC score of -2.4780. This model includes two univariate lags on the univariate input nodes one and two with four exogenous regressor lags across regressor nodes one and two.

df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )

The model identifed is an AR(2) with an AIC score of -2.478.

df2 <- df[1:(nrow(df)-5),]
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.
#VARselect(df2$low, lag.max=6, type="const", season=NULL, exogen=data.frame(t[96:100], df2$volume, df2$HiLo, df2$OpClo, t[96:100]*df2$volume)) # AIC:  p=1 with AIC -2.3158

# trend added manually:
VARselect(df2$low, lag.max=6, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume), type="const") # AIC:  p=1 with AIC -2.3081
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      2      2      2 
## 
## $criteria
##                  1           2         3           4           5
## AIC(n) -2.30801515 -2.47800080 -2.455533 -2.43306769 -2.43298718
## HQ(n)  -2.22911981 -2.38783470 -2.354096 -2.32036006 -2.30900879
## SC(n)  -2.11227971 -2.25430315 -2.203873 -2.15344563 -2.12540291
## FPE(n)  0.09949085  0.08395165  0.085877  0.08785085  0.08788582
##                  6
## AIC(n) -2.43521003
## HQ(n)  -2.29996088
## SC(n)  -2.09966355
## FPE(n)  0.08772417

The ASE for this model is 0.130

# trend added manually:
lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1
preds <- predict(lsfit, n.ahead=5)

ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1296242
## [1] 0.1296242

Plotting the forecast

plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend and Volume Interaction", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)

Vector Autoregressive Modeling with Trend, Time and Volume Interaction, Time and OpClo Interaction

Finally, we modeled with an interaction between trend and volume and trend and the range between open and close prices (we also attempted for interaction between trend and the range between high and low prices, but this hurt the model, potentially indicating there is greater variation in between the high and low prices than the open and close prices). Nonetheless, the ASE resulting from the forecast was higher than the multiplicative model only including an interaction between trend and trade volume. Therefore, the latter model is the one we selected as the top VAR candidate. As with the other VAR models, this model applied an identified second-order autoregressive with an AIC score of -2.4645.

The model identifed is an AR(2) with an AIC score of -2.465.

df2 <- df[1:(nrow(df)-5),]
t <- 1:100
# Because of the white noise cross-correlation between all exogenous variables and the response variable, it's reasonable to expect only an AR(1) from VARselect. The model will quickly approximate the signal's mean without much sensitivity to change.

# find a good count for p for the AR(p) component of the vector autoregressive model. No indication above for lag was required, but :
VARselect(df2$low, lag.max=8, season=NULL, exogen=data.frame(t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume, t[1:95]*df2$OpClo), type="const") # selects p=1 as the best value for AR(p)
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      2      2      2      2 
## 
## $criteria
##                 1           2           3           4           5
## AIC(n) -2.2713589 -2.46459045 -2.44187917 -2.41889398 -2.41500677
## HQ(n)  -2.1800536 -2.36187199 -2.32774754 -2.29334919 -2.27804882
## SC(n)  -2.0446087 -2.20949651 -2.15844146 -2.10711249 -2.07488151
## FPE(n)  0.1032257  0.08510686  0.08708604  0.08914122  0.08952502
##                  6           7           8
## AIC(n) -2.41422092 -2.40044609 -2.37762703
## HQ(n)  -2.26584981 -2.24066181 -2.20642959
## SC(n)  -2.04575189 -2.00363329 -1.95247046
## FPE(n)  0.08963886  0.09093375  0.09309378

The ASE for this model is 0.170

# Build the model using the p identified for AR(p) and the training data. type="const" because trend is included
lsfit <- VAR(y=cbind(df2$low, t[1:95], df2$volume, df2$HiLo, df2$OpClo, t[1:95]*df2$volume, t[1:95]*df2$OpClo), p=1, type="const") # with p=1, this will estimate the coefficients for lag 1

preds <- predict(lsfit, n.ahead=5)

ASE <- mean((df$low[(nrow(df)-4):nrow(df)] - preds$fcst$y1[,1])^2)
ASE # 0.1703138
## [1] 0.1703138

Plotting the forecast

plot(seq(1,100,1), df$low[1:100], type = "l",xlim = c(0,100), xlab = "Trading Days", ylab = "Share Prices", main = "Vector Autoregressive Modeling with Trend", lwd=2)
lines(seq(96,100,1), preds$fcst$y1[,1], type = "l", col = "red", lwd=2)

Neural Network Models: Multi-Layered Perceptrons

A multi-layered perceptron operates using a “feedforward” artificial neural network to pass forward information from the input nodes into a hidden layer of nodes that perform logical activation functions before processing into the output node. In the case of this analysis, lags were detected and applied on input to both the predictor regressors as well as the univariate response.

When adding additional predictor variables to the multivariate models, the forecasts converge less toward the long-run mean and more toward the behavior of the realization itself. Therefore, the addition of useful predictor variables - such as those from natural language processing for sentiment analysis - can aid in further stabilizing useful multivariate models.

Neural Network without Trend

The Neural Network without trend model we produced generated the highest Average Squared Error of all our MLP models. However, the ASE was still highly competitive with all other models we produced. The ASE for this model is 0.1758194.

MLP node configuration

dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

fit.mlp <- mlp(lowTrain, xreg=data.frame(volumeFull, HiLoFull, OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv")

MLP map

fit.mlp
## MLP fit with 6 hidden nodes and 15 repetitions.
## Univariate lags: (1,2)
## 2 regressors included.
## - Regressor 1 lags: (1,2)
## - Regressor 2 lags: (1,2,4)
## Forecast combined using the mean operator.
## MSE: 0.0382.
plot(fit.mlp)

MLP Forecast

fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(volumeFull, HiLoFull, OpCloFull))
plot(fore.mlp)

MLP ASE

Neural Network with Trend

After adding trend into our Neural Network modeling, we successfully lowered the ASE to 0.06421939.

MLP node configuration

t <- 1:nrow(df)

dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv")
fit.mlp
## MLP fit with 1 hidden node and 15 repetitions.
## Univariate lags: (1,2)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.083.

MLP map

plot(fit.mlp)

MLP Forecast

fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull))
plot(fore.mlp)

MLP ASE

ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.06421939
## [1] 0.06459411

Neural Network with Trend, Time and Volume Interaction

When we added a multiplicative term for trend and volume, our ASE lowered from 0.0642 to 0.0549. Although the Neural Nets that included a trend predictor were very close in terms of residual error, this model produced the lowest ASE.

MLP node configuration

t <- 1:nrow(df)

dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull, t*volumeFull), reps = 15, comb = "mean", hd.auto.type="cv")
fit.mlp
## MLP fit with 5 hidden nodes and 15 repetitions.
## Univariate lags: (1,2)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.0599.

MLP map

plot(fit.mlp)

MLP Forecast

fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, HiLoFull, OpCloFull, t*volumeFull))
plot(fore.mlp)

MLP ASE

ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.05486796
## [1] 0.06564621

Neural Network with Trend, Time and Volume Interaction, Time and OpClo Interaction

It is important to note that we did not identify an interaction between time and the spread between high and low prices to benefit the model, but we did find that interaction between time and the spread between open and close prices to be beneficial for reducing ASE. However, the overall reduction from this model increased the ASE from 0.0549 in the model using an interaction between only trend and volume to an ASE of 0.0571.

MLP node configuration

t <- 1:nrow(df)

dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
High <- ts(df$high)
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

fit.mlp <- mlp(lowTrain, xreg=data.frame(t, volumeFull, High, OpCloFull, t*volumeFull, t*OpCloFull), reps = 15, comb = "mean", hd.auto.type="cv", difforder=NULL)
fit.mlp
## MLP fit with 7 hidden nodes and 15 repetitions.
## Univariate lags: (1)
## 3 regressors included.
## - Regressor 1 lags: (1)
## - Regressor 2 lags: (2)
## - Regressor 3 lags: (1,2)
## Forecast combined using the mean operator.
## MSE: 0.0654.

MLP map

plot(fit.mlp)

MLP Forecast

fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(t, volumeFull, High, OpCloFull, t*volumeFull, t*OpCloFull))
plot(fore.mlp)

MLP ASE

ASE = mean((lowTest - fore.mlp$mean)^2)
ASE # 0.05711815
## [1] 0.06303533

Rolling ASEs and Ensemble Model Selection

To provide additional analysis for the best models of each category, we have provided below the Average Squared Error from a rolling 5-point forecast window. This provides insight into how the model estimations can change over time, which is very important for preventing over-fitting and is considered heavily in this project. Please review the below procedures for output and assessment.

When comparing all candidate models, the model with the lowest fixed ASE is the MLP while the model with the lowest 5-point windowed rolling ASE is the Signal Plus Noise. Therefore, we selected the ensemble between these as the best for forecasting. This ensemble provides the flexibility of using multiple predictors while also maintaining a lower risk of over-fitting provided by using the univariate signal plus noise.

Rolling ASE for Neural Networks: Multi-Layered Perceptrons

#MLP
ts = df$low
batch_size = 95
start = 1
num_batches = length(ts)-batch_size+1
NN_ASEs = numeric(num_batches)
dfTrain <- df[1:(nrow(df)-5),]
lowTest <- df$low[(nrow(df)-4):nrow(df)] # forecast actuals
lowTrain <- ts(dfTrain$low) # a training set of the target minus the test forecast
volumeFull <- ts(df$volume) # the full dataset (train + test) of xreg
HiLoFull <- ts(df$HiLo) # the full dataset (train + test) of xreg
OpCloFull <- ts(df$OpClo) # the full dataset (train + test) of xreg

#Rolling ASE
for (i in 0:(num_batches-1))
{
  fit.mlp <- mlp(ts(ts[i:(i+(batch_size-1))]), xreg=data.frame(volumeFull[i:(i+(batch_size-1))], HiLoFull[i:(i+(batch_size-1))], OpCloFull[i:(i+(batch_size-1))]), reps = 15, comb = "mean", hd.auto.type="cv")
  fore.mlp <- forecast(fit.mlp, h=5, xreg=data.frame(volumeFull, HiLoFull, OpCloFull))
  NN_ASEs[i+1] = mean((ts[(length(ts) - 4):length(ts)] - fore.mlp$mean)^2)
  start = start+1
}

NN_ASEs
## [1] 0.1852031 0.1678351 0.1800357 0.1032628 0.1136498 0.0870527
mean(NN_ASEs) # 0.1468514
## [1] 0.1395065

Rolling ASE for Vector Autoregressive

df2 <- df[1:(nrow(df)-5),]
t = 1:100
trainingSize = 91
horizon = 5
ASEHolder_2 = numeric()

for( i in 1:5)
{
  
  VARselect(df2$low[i:(i + (trainingSize - 1))], lag.max=6, season=NULL, exogen=data.frame(t[i:(i + (trainingSize - 1))], df2$volume[i:(i + (trainingSize - 1))], df2$HiLo[i:(i + (trainingSize - 1))], df2$OpClo[i:(i + (trainingSize - 1))], ((df2$volume[i:(i + (trainingSize - 1))]) * t[i:(i + (trainingSize - 1))])), type="const") # AIC picked p=2 with AIC -2.44854735
  lsfit <- VAR(y=cbind(df2$low[i:(i + (trainingSize - 1))], exogen = data.frame(t[i:(i + (trainingSize - 1))], df2$volume[i:(i + (trainingSize - 1))], df2$HiLo[i:(i + (trainingSize - 1))], df2$OpClo[i:(i + (trainingSize - 1))], ((df2$volume[i:(i + (trainingSize - 1))]) * t[i:(i + (trainingSize - 1))]))), p=1, type="const")
  preds <- predict(lsfit, n.ahead=5)

  ASE_2 = mean((df$low[(trainingSize + i):(trainingSize + i + (horizon) - 1)] - preds$fcst$df2.low.i..i....trainingSize...1...[,1])^2)
  
  ASEHolder_2[i] = ASE_2
  
}

rolling_ASE = mean(ASEHolder_2)
rolling_ASE # 0.1869798
## [1] 0.1869798

Rolling ASE for Signal-Plus-Noise

df2 <- df[1:(nrow(df)-5),]
ts = df$low
t2 = 1:95
batch_size = 96
start = 1
t = 100
num_batches = length(ts)-batch_size+1
sigplus_ASEs = numeric(num_batches)

for (i in 0:(num_batches-1))
{
  
  signoise.forecast <- fore.sigplusnoise.wge(df$low[i:(i+(batch_size-1))], max.p=2, n.ahead=5, limits=T, lastn=T, plot=T)

  sigplus_ASEs[i+1] = mean((df$low[(nrow(df)-4):nrow(df)] - signoise.forecast$f)^2)
}
## 
## Coefficients of Original polynomial:  
## 1.1090 -0.3881 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.1090B+0.3881B^2    1.4288+-0.7317i      0.6230       0.0753
##   
## 

## 
## Coefficients of Original polynomial:  
## 1.1077 -0.3812 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.1077B+0.3812B^2    1.4531+-0.7156i      0.6174       0.0728
##   
## 

## 
## Coefficients of Original polynomial:  
## 1.1170 -0.3918 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.1170B+0.3918B^2    1.4254+-0.7213i      0.6260       0.0746
##   
## 

## 
## Coefficients of Original polynomial:  
## 1.0446 -0.3237 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.0446B+0.3237B^2    1.6135+-0.6971i      0.5689       0.0649
##   
## 

## 
## Coefficients of Original polynomial:  
## 1.0617 -0.3442 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.0617B+0.3442B^2    1.5423+-0.7257i      0.5867       0.0700
##   
## 

sigplus_ASEs
## [1] 0.13086079 0.15037944 0.20082465 0.08539453 0.09946937
rolling_ASE_sigplus = mean(sigplus_ASEs)
rolling_ASE_sigplus # 0.1333858
## [1] 0.1333858

Rolling ASE for ARIMA(5,1,0)

ts = df$low
batch_size = 95
start = 1
num_batches = length(ts)-batch_size+1
ASEs = numeric(num_batches)

for (i in 1:5)
{
  forecasts = fore.aruma.wge(ts[start:(batch_size+i)], phi = c(0.1227,-.1183,.1454,-.2587,.029), d = 1, n.ahead = 5, lastn = TRUE, limits = FALSE)
  ASEs[i+1] = mean((ts[(length(ts) - 4):length(ts)] - forecasts$f)^2)
}

rolling_ASE = mean(ASEs) 

rolling_ASE #0.3935643
## [1] 0.3935643

Rolling ASE for Ensemble Model: Signal-Plus-Noise and Multi-Layered Perceptrons

#Ensemble
mlp_pred = as.numeric(fore.mlp$mean)

signoise = as.numeric(signoise.forecast$f)

en_pred = (mlp_pred + signoise)/2

ASE = mean((ts[(length(ts) - 4):length(ts)] - en_pred)^2)
ASE # 0.06966031
## [1] 0.06732777
rolling_ASEs_ensemble = (sigplus_ASEs + NN_ASEs)/2
rolling_ASEs_ensemble
## [1] 0.15803194 0.15910728 0.19043018 0.09432865 0.10655961 0.10895674
rolling_ASE_mean_ensemble = mean((sigplus_ASEs + NN_ASEs)/2)
rolling_ASE_mean_ensemble
## [1] 0.1362357

Appendix

Additional Plots, Forecasting, & Test Results

Matching Signal-Plus-Noise Realization and Estimated Models’ Gnerated Spectral Density and ACFs

#########################################################################################################
############### Confirmation for repeated model visualization of ACF and Spectral Density ###############
#########################################################################################################
df <- stock_data()
## Parsed with column specification:
## cols(
##   times = col_date(format = ""),
##   open = col_double(),
##   high = col_double(),
##   low = col_double(),
##   close = col_double(),
##   volume = col_double(),
##   symbol = col_character(),
##   market = col_character()
## )
# take a sample of the data, analyze
# plotts.sample.wge(df$low, arlimits=T)

####################
# Signal-Plus-Noise
####################
x = df$low
n = length(x)
t = 1:n
fit = lm(x ~ t)

x.z <- fit$residuals # derive residuals (from the regression line)
ar.z <- aic.wge(x.z, p=0:6, type="aic")
est_mod <- gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843)

for( i in 1: 5)
{
   SpecDen2 <- parzen.wge(gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843), plot = T)
   lines(SpecDen2$freq,SpecDen2$pzgram, lwd = 2, col = "red")
}

for( i in 1: 5)
{
   ACF2 <- acf(gen.sigplusnoise.wge(100, b0=34.855438, b1 = 0.072922, phi=ar.z$phi, vara=0.1177843), plot = T)
   lines(ACF2$lag ,ACF2$acf, lwd = 2, col = "red")
}

#########################################################################################################
#########################################################################################################
#########################################################################################################

signoise.forecast <- fore.sigplusnoise.wge(df$low, max.p=2, n.ahead=5, limits=T, lastn=T, plot=T)
## 
## Coefficients of Original polynomial:  
## 1.0507 -0.3152 
## 
## Factor                 Roots                Abs Recip    System Freq 
## 1-1.0507B+0.3152B^2    1.6667+-0.6283i      0.5614       0.0574
##   
## 

SIGNOISE_ASE = mean((df$low[(nrow(df)-5+1):nrow(df)] - signoise.forecast$f)^2)
SIGNOISE_ASE # 0.133713
## [1] 0.133713

Matching ARUMA(p,d,q) Realization and Estimated Models’ Gnerated Spectral Density and ACFs